Skip to content

Added ANTLR parse tree persistent caching for procedures#4547

Draft
manisha-deshpande wants to merge 13 commits intobabelfish-for-postgresql:BABEL_5_X_DEVfrom
amazon-aurora:jira-babel-6037
Draft

Added ANTLR parse tree persistent caching for procedures#4547
manisha-deshpande wants to merge 13 commits intobabelfish-for-postgresql:BABEL_5_X_DEVfrom
amazon-aurora:jira-babel-6037

Conversation

@manisha-deshpande
Copy link
Copy Markdown
Contributor

@manisha-deshpande manisha-deshpande commented Feb 6, 2026

Description

BABEL-6037: Cross-session ANTLR parse tree caching for T-SQL stored procedures.

Stored procedures with thousands of lines (e.g., ~1300 lines) take excessive time on first execution in each new session due to redundant ANTLR parsing. The PLtsql function hash table is session-scoped, so every new session re-parses from scratch.

This PR introduces persistent ANTLR parse tree caching. Serialized parse trees are stored in new column in sys.babelfish_function_ext using PostgreSQL's nodeToString()/stringToNode() framework. On first execution in a new session, cached results are deserialized to skip ANTLR re-parsing, with version and modify-date validation to prevent serving stale data.

Issues Resolved

BABEL-6037

Changes

Serialization Infrastructure

  • gen_pltsql_node_support.pl code generator (modeled after PG's gen_node_support.pl) produces pltsql_nodetags.h, pltsql_outfuncs_defs.c, and pltsql_readfuncs_defs.c from annotated header files (pltsql_serializable_1.h, pltsql_serializable_2.h which are copies of pltsql.h, pltsql-2.h)
  • Extension-owned T_PLtsql_* NodeTag values offset from 1000 to avoid collision with engine NodeTag enum, with an ABI stability check that fails the build if node types are added without updating the last nodetag constants
  • Wrapper files pltsql_outfuncs.c and pltsql_readfuncs.c expose pltsql_outNode() and pltsql_parseNodeString() dispatch functions
  • pltsql_serialize_macros.h replicates engine-internal WRITE_*/READ_* macros
  • pltsql_node_stubs.c provides custom read/write handlers for nodes requiring special serialization logic (flexible array members, runtime-only fields, string/int arrays)
    • Redefines necessary WRITE_*/READ_* macros locally (not exposed in any PG header) (from postgresql_modified_for_babelfish/src/backend/nodes/readfuncs.c and outfuncs.c)
    • Custom read/write implementations for PLtsql nodes marked custom_read_write:
      • PLtsql_expr — skips runtime-only fields (plan, func, expr_simple_*)
      • PLtsql_nsitem — handles FLEXIBLE_ARRAY_MEMBER for name[]
      • PLtsql_row — handles string/int arrays (fieldnames, varnos) with array_size(nfields)
      • PLtsql_recfield — skips runtime cache fields (rectupledescid, finfo)
  • pl_handler.c registers outNode_hook and parseNodeString_hook in _PG_init() so the engine delegates to extension code for PLtsql node types
  • Upon encountering serialization errors, the code gracefully falls back to full ANTLR recompile.

Catalog Changes

Five new columns in sys.babelfish_function_ext:

  • antlr_parse_tree_text TEXT — serialized parse tree (nodeToString output)
  • antlr_parse_tree_datums TEXT — serialized datum array
  • antlr_parse_tree_modify_date DATETIME — timestamp for staleness detection
  • antlr_parse_tree_bbf_version TEXT — Babelfish version at serialization time
  • antlr_cache_enabled BOOL — per-function cache flag (default false)

Upgrade SQL in babelfishpg_tsql--5.5.0--5.6.0.sql adds these columns with allow_system_table_mods guard for CI compatibility.

Cache Lifecycle

  • CREATE/ALTER PROCEDURE: Compiles the procedure, serializes the parse tree to babelfish_function_ext, and populates the in-session PLtsql hash table
  • 1st EXEC in new session: Hash table miss → deserializes from babelfish_function_ext, validates bbf_version and modify_date, populates hash table
  • 2nd+ EXEC in same session: Hash table hit (no deserialization needed)
  • ALTER with GUC disabled: Sets parse tree columns to NULL to force reparsing
  • DROP PROCEDURE: Removes the catalog row and per-function flag
  • MVU (Major Version Upgrade): Version mismatch rejects cache entries and forces ANTLR parsing

GUC Configuration

  • babelfishpg_tsql.enable_routine_parse_cache (session-level, PGC_USERSET, default false) — global toggle for enabling/disabling cache reads and writes
  • sys.enable_routine_parse_cache(func_identifier TEXT, enable_flag BOOLEAN) — per-function granularity; accepts schema.func, schema.func(argtypes), or func (defaults to dbo); disabling NULLs out cache columns for immediate invalidation

Build System

  • Makefile updated to compile serialization wrapper .o files with dependency rules for generated files from gen_pltsql_node_support.pl
  • Generated files added to .gitignore

Node Allocation

  • All PLtsql node palloc0() calls replaced with makeNode() to set proper NodeTag values required by the serialization framework

Performance Results

  • Customer procedure (~1300 lines): 2031ms → 15ms first-execution time (99% reduction)
  • Stress test (1000 connections): Average first-execution time ~4314ms → ~175ms

Test Scenarios Covered

[TBD]

  • Use case based -

  • Boundary conditions -

  • Arbitrary inputs -

  • Negative test cases -

  • Minor version upgrade tests -

  • Major version upgrade tests -

  • Performance tests -

  • Tooling impact -

  • Client tests -

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is under the terms of the Apache 2.0 and PostgreSQL licenses, and grant any person obtaining a copy of the contribution permission to relicense all or a portion of my contribution to the PostgreSQL License solely to contribute all or a portion of my contribution to the PostgreSQL open source project.

For more information on following Developer Certificate of Origin and signing off your commits, please check here.

create_date SYS.DATETIME NOT NULL,
modify_date SYS.DATETIME NOT NULL,
definition sys.NTEXT DEFAULT NULL,
antlr_parse_tree JSONB DEFAULT NULL, -- JSONB serialized ANTLR parse tree for caching
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

babelfish_function_ext regression test fails.
Column can be queried from postgres side, not babelfish side due to unsupported datatype.
Should the column type be TEXT instead? Would that affect storage space?

--- /home/runner/work/babelfish_extensions/babelfish_extensions/test/JDBC/./expected/babelfish_function_ext-vu-cleanup.out	2026-02-06 18:16:42.761144533 +0000
+++ /home/runner/work/babelfish_extensions/babelfish_extensions/test/JDBC/./output/babelfish_function_ext-vu-cleanup.out	2026-02-06 18:42:00.808212640 +0000
@@ -52,7 +52,7 @@
 -- babelfish_function_ext entry should have been removed after dropping all these functions/procedure
 SELECT * FROM sys.babelfish_function_ext WHERE funcname LIKE 'babel_2877_vu_prepare%';
 GO
-~~START~~
-varchar#!#varchar#!#nvarchar#!#text#!#text#!#bigint#!#bigint#!#datetime#!#datetime#!#ntext
-~~END~~
+~~ERROR (Code: 33557097)~~
+
+~~ERROR (Message: data type jsonb is not supported yet)~~

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore this error for now. We can later add a TDS sender function for JSONB (which just sends it as JSON)

Should the column type be TEXT instead? Would that affect storage space?

Yes, JSONB will allow fast lookups compared to JSON/TEXT which will required deserialization of its own. (Which will become a problem for bigger procedures).

*
* This header provides the interface for serializing and deserializing
* ANTLR PLtsql parse trees to/from JSONB format. The serialized data is
* stored in the cross-session cache (babelfish_func_ext) to enable faster
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a catalog rather than a cache - even though the catalog will be cached


/* Read the value for this key */
tok = JsonbIteratorNext(&ctx.it, &v, false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why overwrite tok ?

Copy link
Copy Markdown
Contributor

@robverschoor robverschoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addded some comments

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
…ble and type

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
…E, TRY Statements

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
…d retrieval

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Add cross-session ANTLR parse tree caching for T-SQL stored procedures.
Serialized parse trees and datums are stored in babelfish_function_ext
using nodeToString/stringToNode. On procedure execution, cached results
are restored to skip ANTLR re-parsing. Cache reads validate the stored
bbf_version and modify_date before deserializing, skipping stale entries
from different Babelfish versions or procedures modified with the GUC
disabled.

Changes:
- Add antlr_parse_tree_text, antlr_parse_tree_datums,
  antlr_parse_tree_modify_date, and antlr_parse_tree_bbf_version columns
to sys.babelfish_function_ext
- Store serialized parse tree and version in
  pltsql_store_func_default_positions
- Restore and validate cached parse tree in new function
  pltsql_restore_func_parse_result invoked prior to ANTLR parse

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Add bbf_version validation, exec-time cache repopulation, and
rename/alter/dependency invalidation logic and tests

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Move PLtsql outfuncs/readfuncs code generation entirely to the extension,
eliminating the need for PLtsql-specific headers in the engine's
gen_node_support.pl input files.

Key changes:

- gen_pltsql_node_support.pl now generates pltsql_nodetags.h with
  extension-owned T_PLtsql_* NodeTag values (offset from 1000 to avoid
  collision with engine's NodeTag enum). Includes ABI stability check
  that fails the build if node types are added without updating
  $last_nodetag/$last_nodetag_no.

- Wrapper files pltsql_outfuncs.c and pltsql_readfuncs.c mirror the
  engine's pattern: #include the generated static functions and switch
  fragments, expose public pltsql_outNode() and pltsql_parseNodeString()
  dispatch functions.

- pltsql_serialize_macros.h provides WRITE_*/READ_* macros replicated
  from engine internals (not exposed in any PG header).

- pl_handler.c registers outNode_hook and parseNodeString_hook in
  _PG_init() so the engine's outNode()/parseNodeString() delegate to
  extension code for PLtsql node types.

- pltsql.h includes generated pltsql_nodetags.h for T_PLtsql_* defines.

- Makefile updated: compiles wrapper .o files (not gen .o directly),
  with proper dependency rules for generated files.

Files changed:
  src/pltsql_serialize/gen_pltsql_node_support.pl  - nodetags generation + ABI check
  src/pltsql_serialize/pltsql_outfuncs.c           - new wrapper
  src/pltsql_serialize/pltsql_readfuncs.c          - new wrapper
  src/pltsql_serialize/pltsql_serialize_macros.h   - shared macros
  src/pltsql_serialize/pltsql_node_stubs.c         - custom read/write nodes
  src/pltsql.h                                     - include pltsql_nodetags.h
  src/pl_handler.c                                 - register hooks
  Makefile                                         - build rules

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Add sys.enable_routine_parse_cache(TEXT, BOOLEAN) to enable or disable
ANTLR parse tree caching for individual functions. Complements the
existing global session GUC with per-function granularity.

Changes:
- New antlr_cache_enabled column in babelfish_function_ext (default false)
- Function accepts schema.func, schema.func(argtypes), or func (dbo default)
- Returns BOOLEAN confirming the flag that was set
- Disabling NULLs out cache columns for immediate invalidation
- ALTER PROCEDURE preserves the per-function flag
- DROP PROCEDURE removes the flag with the row
- allow_system_table_mods guard in upgrade SQL for CI compatibility
- Tests covering full signature, simple name, no-schema, custom schema,
  error cases, ALTER preservation, and DROP cleanup

Task: BABEL-6037
Signed-off-by: Manisha Deshpande <mmdeshp@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants